Final Project - Report
INF 385T.9 - Data Wrangling Final Project: Does Positive Pop Music Portend Positive Attitudes?
Introduction
Investigation: Do countries with more upbeat number-one songs on pop music charts have
higher overall levels of happiness?
We will be working with data harvested from Spotify about popular music, Billboard Chart
Data from countries across the world, and the World Happiness Report to determine whether
countries that report higher happiness levels also listen to more upbeat music. Datasets
available online list metadata for thousands of popular songs, including songs that have topped
the charts in countries as far flung as Turkey, Brazil, and Japan.
By filtering large metadatasets of music to find matches with songs that hit number one
around the world, we hope to find trends among popular songs in each nation. We plan to look
at factors such as tempo, danceability, and energy to determine a song’ “upbeat-ness,” using
Billboard charts to determine the average “upbeatness” of popular songs in a given country, and
then visualizing the relative upbeatness of popular music in a country with Life Evaluation
Scores from the World Happiness Report to determine if there is a correlation between the
upbeatness of popular music in a given nation and that nation’s happiness.
While we hoped to use Spotify’s API to harvest up-to-date metadata about currently
popular songs in various locales, Spotify unfortunately removed this feature from their API last
fall. However, ample archived datasets of thousands of songs exist on Kaggle and Github.
Between these datasets, Billboard Chart data,, and the World Happiness Report, we hope to
uncover answers to the following questions:
Do happier countries listen to more upbeat, danceable music?
If so, will these trends appear regardless of socioeconomic factors like GDP per capita, or will we
need to control for wealth and human development?Have these trends changed since 2023, when the analytic data was collected?
Or do the happier countries’ trend remain the same?
Datasets
Dataset 1: World Happiness Report - 2023
The World Happiness Report, published annually by the University of Oxford’s Wellbeing
Research Centre in partnership with Gallup, attempts to understand and analyze how people
experience and process happiness by using a questioning method called Cantril Ladder. The
participants are asked the following question to rate their “happiness”:
“Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The
top of the ladder represents the best possible life for you and the bottom of the ladder
represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?” (Wellbeing Research Centre at the University of
Oxford & Gallup, 2025).
This question is carefully worded so that it can be universally translated and understood. On
average, more than 100,000 people from 140 different countries participated in the survey and
data is collected throughout the year to accommodate differences in local factors.
For this project, the following columns will be extracted from the dataset:
Country name: Official name of the represented country
Ladder score which represents the “happiness score” we are mainly interested in.
The Log GDP per capita measures the average economic well-being per person using a
logarithmic scaleSocial support indicated whether participants have someone they can rely on during
difficult situations. The answers were logged using a binary system where 0 is no and 1
is yes.Generosity reflects whether participants have donated any money to charity in the past
month. Once again, the answers were logged using a binary system where 0 is no and 1
is yes.
Dataset 2: Spotify Global Top 50 Song Data (Kaggle)
This dataset, compiled by KEVINAM, was extracted from Spotify’s Top 50 Global Chart
(10/18/23 to 11/18/23) using the Spotify Web API. This dataset contains not only information
about the playlist but it also includes key music analytics such as danceability, energy, loudness,
tempo, and country where it was ranked.
Danceability “describes how suitable a track is for dancing based on a combination of
musical elements including tempo, rhythm stability, beat strength, and overall regularity” (Spotify, 2025). Its values range from 0.0 to 1.0, where higher value is considered more
‘danceable.’Energy represents the “perceptual measure of intensity” and activity of the song and is
also measured from 0 to 1.0 scale (Spotify, 2025).The Loudness is the average decibel of the track, typically ranging from -60 to 0 dB.
Valence measures the musical positiveness of a track, with higher scores indicating happier and more cheerful songs, and lower scores suggesting sad, angry, or depressed moods.
Tempo is the song’s speed represented by beats per minute (BPM).
Dataset 3: World Data 2023 (Kaggle)
This dataset will serve as a reference to match the country abbreviations from the Spotify
dataset to their formal country names. This dataset was the most appropriate mapping source
since both this and the Spotify datasets were compiled by the same author. From this dataset,
only the Country and Abbreviation columns will be extracted.
Dataset 4: Aggregated Weekly Chart (Kworb)
To compare national happiness with current music trends, this dataset will include recent
popular music charts from kworb.net for selected countries. Since this website updates its chart on daily and weekly basis, for this project, chart from 10/16/2025 will be used. Only the ‘Artist’ and ‘Title’ columns will be extracted.
New Zealand
South Korea
USA
Dataset 5: Universal Top Spotify Songs (Kaggle)
This dataset compiles top songs that are trending up to June 2025. Though the author of the dataset is different, this dataset has almost identical columns as the dataset 2. Thus, same columns will be extracted.
Danceability “describes how suitable a track is for dancing based on a combination of
musical elements including tempo, rhythm stability, beat strength, and overall regularity” (Spotify, 2025). Its values range from 0.0 to 1.0, where higher value is considered more
‘danceable.’Energy represents the “perceptual measure of intensity” and activity of the song and is
also measured from 0 to 1.0 scale (Spotify, 2025).The Loudness is the average decibel of the track, typically ranging from -60 to 0 dB.
Valence measures the musical positiveness of a track, with higher scores indicating happier and more cheerful songs, and lower scores suggesting sad, angry, or depressed moods.
Tempo is the song’s speed represented by beats per minute (BPM).
Conclusion
top 10 and bottom 10 heatmap
2023 -2025 comparison
Challenges
One of the most interesting challenges we faced was ensuring that R Tidyverse, Python Pandas, SQL, and Excel Power Query all produced the exact same results. Since we started working on the R Tidyverse code first, we used those results as our baseline. However, when we tried to replicate the process in Python and SQL, each produced different results and we were stumped as to why the number of rows didn’t match, despite using identical datasets and logic.
After some trial and error, we discovered that the problem were caused by the way some titles were written. For the Kworb’s Weekly Spotify Charts, as it can be seen from the dataset screenshot above, we needed to split the Artists and Title column into its own respective columns. To do this, we defined hyphen as delimiter and selected the first indexed substring as artists and the second indexed substring as title. This method worked perfectly for majority of songs. However, songs like “Hotel California – 2013 remaster” contained additional hyphens and only a part of the title were kept (e.g. “Hotel California”). This caused a join error since the 2025 Spotify dataset retained the full title.
To overcome this challenge, we needed to find a way to group every substring after the first delimiter into one. This was a relatively easy fix for the Tidyverse and Pandas as we only needed to include one additional parameter inside the string splitting function (see figures below). However, this was exceptionally difficult with SQL. After thorough research, we found a solution with SUBSTR() and STRPOS() functions. Rather than using the STRING_SPLIT() function, we needed to use the SUBSTR() to first extract sub-strings within the ‘Artist and Title’ column. Then we needed to use STRPOS() function to specify that every substring after the first delimiter needed to be group as one. The following is the final code for this task in duckDB SQL is shown below:
Once those changes were made, we were able to obtain identical rows across all languages and were able to proceed to correlation analysis.